Reference genome of the color polymorphic desert annual plant sandblossoms, Linanthus parryae

Abstract Sandblossoms, Linanthus parryae is a widespread annual plant species found in washes and sandy open habitats across the Mojave Desert and Eastern Sierra Nevada of California. Studies in this species have played a central role in evolutionary biology, serving as the first test cases of the shifting balance theory of evolution, models of isolation by distance, and metrics to describe the genetic structure of natural populations. Despite the importance of L. parryae in the development of landscape genetics and phylogeography, there are no genomic resources available for the species. Through the California Conservation Genomics Project, we assembled the first genome in the genus Linanthus. Using PacBio HiFi long reads and Hi-C chromatin conformation capture, we assembled 123 scaffolds spanning 1.51 Gb of the 1.96 Gb estimated genome, with a contig N50 of 18.7 Mb and a scaffold N50 of 124.8 Mb. This assembly, with a BUSCO completeness score of 88.7%, will allow us to revisit foundational ideas central to our understanding of how evolutionary forces operate in a geographic landscape. In addition, it will be a new resource to uncover adaptations to arid environments in the fragile desert habitat threatened by urban and solar farm development, climate change, and off-road vehicles.


Introduction
The annual desert wildflower Linanthus parryae is an iconic species of the Mojave Desert. In years with higher rainfall, the species germinates prolifically and covers the desert floor with its well-known display of white and blue flowers. The species is endemic to California and occurs from 360 to 2100 m in elevation. Its geographic range extends across the western Mojave Desert, north through the Great Basin Desert of the Owens Valley, and in scattered populations across the Tehachapi Mountains and the southern inner CoastRange north to Shell Creek in San Luis Obispo County (see Fig. 1; Jepson Flora Project 2022). The species can be abundant in years of high rainfall, while in dry years the seeds remain dormant in the soil seedbank. L. parryae germinates in the winter and completes its life cycle by May.
One of the best-known attributes of this small desert plant is the color polymorphism that it displays both within populations and across its range. Populations are white flowered, blue-purple flowered, or polymorphic with both blue and white flowers often co-occurring in the same local habitat patch. Based on an unpublished herbarium survey, we found that approximately 40% of populations were polymorphic, and that these populations were distributed throughout the range of the species.
This extreme level of flower color polymorphism in L. parryae has attracted the attention of evolutionary biologists for generations and served as a model in debates about the degree to which natural selection or genetic drift contribute to the evolution of adaptively important characters in nature (Schemske and Bierzychudek 2007;Ishida 2017). In a highly influential paper, Epling and Dobzhansky mapped the relative frequencies of blue and white flowers across populations in the western Mojave Desert and concluded that populations with different color morph frequencies must represent subtly differentiated "microgeographic races" with relatively restricted gene flow between them (Epling and Dobzhansky 1942). These data were then used to test Wright's shifting balance theory of evolution and assess the sometimes antagonistic roles of genetic drift and natural selection in maintaining intraspecific polymorphisms (Wright 1931(Wright , 1937. The L. parryae color polymorphism data were also used in the first empirical tests of models of isolation by distance (IBD) and were fundamental to the development of Wright's F statistics that describe the genetic structure of natural populations to infer migration frequencies, inbreeding coefficients, and other aspects of demographic history (Wright 1943). Later studies proposed that natural selection linked to variation in rain regimes is the main evolutionary force maintaining L. parryae flower color polymorphism Bierzychudek 2001, 2007). Given the importance of this species in these foundational studies, modern genomic approaches seem to be the inevitable next step in examining the genetic basis and Fig. 1. Linanthus parryae is a minute annual wildflower, widespread in the Mojave Desert. (A) Dissected L. parryae flowers, with the gynoecium, fused corolla and androecium, calyx, and subtending leaves separated and labeled. (B) Blue and white corolla polymorphic populations of L. parryae can be found throughout the species' range. (C) A map of the species range including locations of herbarium specimens and observation reports aggregated by Calflora.org. The location of the specimen used for the genome assembly is marked in red. (D) A sandy wash west of Owens Lake, Inyo County, California, representative of L. parryae habitat. molecular mechanisms underlying this flower color polymorphism in this classic evolutionary genetic model system. These studies could also be expanded to include other flower color polymorphic species in the genus Linanthus including L. dianthiflorus, L. killipii, L. bernardinus, and L. orcuttii. Exploring the genetic basis for polymorphisms across closely related species and the roles of selection, drift, and gene flow operating on their respective landscapes will shed light on how polymorphisms arise and are maintained in nature.
Despite the central role of L. parryae in the development of phylogeography, landscape genetics, and population genetics, its habitat is in peril. Large swaths of the Mojave Desert are being developed for housing and photovoltaic power grids and utilized by Off-Highway Vehicles (OHVs). These flat sandy areas often coincide with the habitat preferred by desert annuals including L. parryae. Additionally, increases in temperature associated with global climate change can lead to range shifts and local extinctions for desert species that already occupy the upper extremes of plant thermal tolerance (Osmond et al. 1987;Lenoir and Svenning 2015). For such desert species to survive a changing climate, they must shift their range much faster than species in coastal or mountainous regions, which may be difficult to achieve in the shallow climatic gradients that characterize much of the desert (Loarie et al. 2009). At the same time, wetter microhabitats seem to be correlated with the ability of desert species to tolerate higher temperatures (Curtis et al. 2016). The large range of L. parryae throughout arid areas of California coupled with its adaptation to fluctuating patterns of rainfall make it an ideal species to study the ways that small desert annuals adapt to climate change, and to inform decision-makers on the extent to which protected areas will effectively support the survival of desert species. Understanding how desert annual species respond to global climatic changes and habitat destruction is therefore crucial to the conservation and protection of the desert biome that occupies 38% of California (Mooney and Zavaleta 2016).
This study reports the first chromosome-level genome assembly of L. parryae. To our knowledge, this is the second chromosome-level genome in the family Polemoniaceae (Jarvis et al. 2022), a group of significant historical importance in plant biosystematics and evolutionary biology (Grant and Grant 1965). This project was completed as part of the California Conservation Genomics Project (CCGP), an initiative with the goal of assembling genomic resources of endemic species in the state to inform conservation and management efforts Shaffer et al. 2022). This reference genome will provide crucial resources to study the genetic mechanisms underlying and maintaining flower color polymorphism in the species. It will also serve as the foundation for studies identifying hotspots of genetic diversity and connectivity, assessing the genomic health of L. parryae, and investigating the genomic basis of adaptation to extreme environments.

Biological materials
We collected L. parryae individuals from a population with only white-flowered morphs. The 2 individuals sequenced in this project were collected on 17 May 2020, in the town of Phelan in the Mojave Desert, San Bernardino County, California. The location of the population was roadside, north of Muscatel Street, west of the intersection with Windemere Road, near the powerline, approximately 1 mile south of Phelan Road (34.412122° N, −117.493026° W). We collected entire plants, including leaves, flowers, stems, and roots. The plants were in the budding and flowering developmental stage. Immediately after collection, we placed each individual in separate 15 mL Nalgene bottles and kept the bottles in a liquid nitrogen dewar while in the field. The 2 individual plants were then stored in a −80 °C freezer. Individual plant IGA184.5 was shipped to the University of California Davis for high molecular weight (HMW) DNA extractions and Pacific BioSciences HiFi library preparation and sequencing (Pacific BioSciences-PacBio, Menlo Park, CA). Individual plant IGA184.4 was sent to University of California Santa Cruz for Omni-C library preparation and sequencing. A voucher for this population was previously collected (accession number UCR-112367) and it is stored at the herbarium of University of California, Riverside (UCR).

Omni-C library preparation and sequencing
The Omni-C library was prepared using the Dovetail Omni-C Kit (Dovetail Genomics, Scotts Valley, CA) according to the manufacturer's protocol with slight modifications. First, specimen tissue from the whole plant (UCR112367; IGA184.4) was thoroughly ground with a mortar and pestle while cooled with liquid nitrogen. Nuclear isolation was then performed using published methods (Inglis et al. 2018). Subsequently, chromatin was fixed in place in the nucleus and digested under various conditions of DNase I until a suitable fragment length distribution of DNA molecules was obtained. Chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adaptercontaining ends. After proximity ligation, crosslinks were reversed, and the DNA purified from proteins. Purified DNA was treated to remove biotin that was not internal to ligated fragments. An NGS library was generated using an NEB Ultra II DNA Library Prep kit (New England Biolabs, Ipswich, MA) with an Illumina-compatible y-adaptor, biotin-containing fragments were captured using streptavidin beads, and the postcapture product was split into 2 replicates prior to PCR enrichment to preserve library complexity, with each replicate receiving unique dual indices. The library was sequenced at Vincent J. Coates Genomics Sequencing Lab (Berkeley, CA) on an Illumina NovaSeq (Illumina, CA) platform to generate approximately 100 million 2 × 150 bp read pairs per Gb genome size.

DNA extraction
We extracted HMW genomic DNA (gDNA) from whole plant tissue of a single individual (600 mg; IGA184.5) using the method described in Inglis et al. (2018), with the following modifications. We used sodium metabisulfite (1%, w/v) instead of 2-mercaptoethanol (1%, v/v) in the sorbitol wash buffer and the lysis buffer. Using mortar and pestle, we pulverized the frozen tissue in liquid nitrogen for 15 min, then gently resuspended it in 10 mL of sorbitol wash buffer. The suspension was centrifuged at 3000 × g for 5 min at room temperature, and the supernatant was discarded. Using a paintbrush, we gently resuspended the ground tissue pellet in 10 mL of sorbitol wash buffer and repeated the wash step 5 times to remove potential contaminants that may coprecipitate with DNA. We performed the lysis step at 45 °C (instead of the standard 65 °C, to avoid potential heat-induced DNA damage) for 1 h with gentle inversion every 15 min. The DNA purity was estimated using absorbance ratios (260/280 = 1.80 and 260/230 = 2.24) on a NanoDrop ND-1000 spectrophotometer. The final DNA yield (84 ng/µL; 32 µg) was quantified using a Quantus Fluorometer (QuantiFluor ONE dsDNA Dye assay; Promega, Madison, WI). The size distribution of the HMW DNA was estimated using the Femto Pulse system (Agilent, Santa Clara, CA), where 52% of the DNA fragments were found to be 45 kb or longer.

HiFi library preparation and sequencing
The HiFi SMRTbell library was constructed using the SMRTbell Express Template Prep Kit v2.0 (PacBio, Cat. #100-938-900) according to the manufacturer's instructions. HMW gDNA was sheared to a target DNA size distribution between 15 and 20 kb. The sheared gDNA was concentrated using 0.45 of AMPure PB beads (PacBio, Cat. #100-265-900) for the removal of single-strand overhangs at 37 °C for 15 min, followed by further enzymatic steps of DNA damage repair at 37 °C for 30 min, end repair and A-tailing at 20 °C for 10 min and 65 °C for 30 min, ligation of overhang adapter v3 at 20 °C for 60 min and 65 °C for 10 min to inactivate the ligase, then nuclease treated at 37 °C for 1 h. The SMRTbell library was purified and concentrated with 0.45 Ampure PB beads (PacBio, Cat. #100-265-900) for size selection using the BluePippin/PippinHT system (Sage Science, Beverly, MA; Cat #BLF7510/HPE7510) to collect fragments greater than 7 to 9 kb. The 15 to 20 kb average HiFi SMRTbell library was sequenced at UC Davis DNA Technologies Core (Davis, CA) using 3 8M SMRT cells, Sequel II sequencing chemistry 2.0, and 30-h movies each on a PacBio Sequel II sequencer.

Nuclear genome assembly
We assembled the L. parryae genome following the CCGP assembly protocol Version 4.0, as outlined in Table 1 which lists the nondefault parameters used in the assembly. As with other CCGP assemblies, our goal was to produce a high-quality and highly contiguous assembly using PacBio HiFi reads and Omni-C data while minimizing manual curation. We removed remnant adapter sequences from the PacBio HiFi dataset using HiFiAdapterFilt (Sim et al. 2022) and obtained the dual or partially phased initial diploid assembly (http://lh3.github. io/2021/10/10/introducing-dual-assembly) using HiFiasm (Cheng et al. 2021). We tagged output haplotype 1 as the primary assembly, and output haplotype 2 as the alternate assembly. We aligned the Omni-C data to the assemblies by using the Arima Genomics Mapping Pipeline (https://github. com/ArimaGenomics/mapping_pipeline) and then scaffolded both assemblies with SALSA (Ghurye et al. 2017(Ghurye et al. , 2018. Next, we identified sequences corresponding to haplotypic duplications, contig overlaps and repeats on the primary assembly with purge_dups (Guan et al. 2020) and transferred them to the alternate assembly.
We generated Omni-C contact maps for both assemblies by aligning the Omni-C data against the corresponding assembly with BWA-MEM (Li 2013), identified ligation junctions, and generated Omni-C pairs using pairtools (Goloborodko et al. 2018). We generated a multiresolution Omni-C matrix with cooler (Abdennur and Mirny 2020) and balanced it with hicExplorer (Ramírez et al. 2018). We used HiGlass (Kerpedjiev et al. 2018) and the PretextSuite (https://github. com/wtsi-hpag/PretextView; https://github.com/wtsi-hpag/ PretextMap; https://github.com/wtsi-hpag/PretextSnapshot) to visualize the contact maps, then checked the contact maps for major misassemblies. In detail, if in the proximity of a join that was made by the scaffolder, we identified a strong signal off-diagonal and lack of signal in the consecutive genomic region, we marked this join. All the joins that were marked, were "dissolved", meaning that we broke the scaffolds at the coordinates of these joins. After this process, no further joins were made. Using the PacBio HiFi reads and YAGCloser (https://github.com/merlyescalona/yagcloser), we closed some of the remaining gaps generated during scaffolding. We then checked for contamination using the BlobToolKit Framework (Challis et al. 2020). Finally, we trimmed remnants of sequence adaptors and mitochondrial contamination identified during the contamination screening performed by NCBI.

Genome size estimation and quality assessment
We generated k-mer counts from the PacBio HiFi reads using meryl (https://github.com/marbl/meryl). The k-mer database was then used in GenomeScope 2.0 (Ranallo-Benavidez et al. 2020) to estimate genome features including genome size, heterozygosity, and repeat content. To obtain general contiguity metrics, we ran QUAST (Gurevich et al. 2013). To evaluate genome quality and completeness we used BUSCO (Manni et al. 2021) with the embryophyta ortholog database (embryophyta_odb10) which contains 1,614 genes. Assessment of base level accuracy (QV) and k-mer completeness was performed using the previously generated meryl database and merqury (Rhie et al. 2020). We further estimated genome assembly accuracy via BUSCO gene set frameshift analysis using the pipeline described in Korlach et al. (2017). We identified and annotated repeat sequences using RepeatModeler and RepeatMasker Hubley 2008-2015;Smit et al. 2013Smit et al. -2015.
Measurements of the size of the phased blocks are based on the size of the contigs generated by HiFiasm in HiC mode. We followed the quality metric nomenclature established by Rhie et al. (2021), with the genome quality code x·y·P·Q·C, where x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10[phased block NG50]; Q = Phred base accuracy QV (quality value); C = % genome represented by the first "n" scaffolds, following a known karyotype 2n = 18 (Patterson 1979). Quality metrics for the notation were calculated on the primary assembly.

Results
The Omni-C and PacBio HiFi sequencing libraries generated 295.4 million read pairs and 4.8 million reads, respectively. The latter yielded 39.39-fold coverage (N50 read length 16,891 bp; minimum read length 48 bp; mean read length 16,034 bp; maximum read length 58,036 bp). Calculation of coverage is based on a flow cytometry estimated genome size of 1.96 Gb. The GenomeScope 2.0 genome size estimation was 1.6 Gb. Based on PacBio HiFi reads, we estimated a 0.182% sequencing error rate and 4.65% nucleotide heterozygosity rate. The k-mer spectrum based on PacBio HiFi reads ( Fig. 2A) shows a bimodal distribution with 2 major peaks at ~23-and ~45-fold coverage, where peaks correspond to homozygous and heterozygous states of a diploid species, respectively. The distribution presented in this k-mer spectrum supports that of a high heterozygosity profile.
The final assembly (ddLinParr1) consists of 2 pseudo haplotypes, primary and alternate, both genome sizes similar to the estimated value from GenomeScope 2.0 ( Fig. 2A) Table 2, and a graphical representation for the primary assembly in Fig. 2B (see Supplementary Fig. 1 for the alternate assembly). The Omni-C contact map suggests that the primary assembly is highly contiguous (Fig. 2C).
We identified a total of 12 misassemblies, 6 per assembly, and broke the corresponding joins made by SALSA2 on both assemblies. We were able to close a total of 22 gaps, 9 on the primary and 13 on the alternate assembly. Finally, we filtered 15 contigs from the primary assembly, 1 matching to Oomycota and the rest to Ascomycota contaminants. No further contigs were removed. The primary assembly has a BUSCO completeness score of 88.7% using the embryophyta gene set, a per base quality (QV) of 68.93, a k-mer completeness of 58.96 and a frameshift indel QV of 46.04. The alternate assembly has a BUSCO completeness score of 84.4% using the embryophyta gene set, a per base quality (QV) of 66.36, a k-mer completeness of 60.95 and a frameshift indel QV of 47.1. We have deposited scaffolds corresponding to both primary and alternate assemblies on NCBI (see Table 2 and Data availability for details).

Discussion
Prior to sequencing, we estimated the genome size of L. parryae using flow cytometry. Our 1C estimates were between 1.93 and 1.99 Gb. This difference in size compared with the GenomeScope estimation of 1.6 Gb may be due to repetitive elements in the genome. Based on flow cytometry, the species seems to have an average genome size compared with other Software citations are listed in the text.

Fig. 2.
Visual overview of the Linanthus parryae genome assembly metrics. (A) K-mer spectra output generated from PacBio HiFi data without adapters using GenomeScope 2.0. The bimodal pattern observed corresponds to a diploid genome and the k-mer profile matches that of high heterozygosity. K-mers at lower coverage and high frequency correspond to differences between haplotypes, whereas the higher coverage and low frequency k-mers correspond to the similarities between haplotypes. (B) BlobToolKit snail plot showing a graphical representation of the quality metrics presented in Table 2 for the Linanthus parryae primary assembly (ddLinParr1.0.p). The plot circle represents the full size of the assembly. From the inside-out, the central plot covers length-related metrics. The red line represents the size of the longest scaffold; all other scaffolds are arranged in size-order moving clockwise around the plot and drawn in gray starting from the outside of the central plot. Dark and light orange arcs show the scaffold N50 and scaffold N90 values. The central light gray spiral shows the cumulative scaffold count with a white line at each order of magnitude. White regions in this area reflect the proportion of Ns in the assembly; the dark versus light blue area around it shows mean, maximum and minimum GC versus AT content at 0.1% intervals (Challis et al. 2020). Hi-C contact maps for the primary (C) and alternate (D) genome assembly generated with PretextSnapshot. Hi-C contact maps translate proximity of genomic regions in 3D space to contiguous linear organization. Each cell in the contact map corresponds to sequencing data supporting the linkage (or join) between 2 of such regions. Scaffolds are separated by black lines and higher density of the lines may correspond to higher levels of fragmentation.
species in the genus. Flow cytometry genome size estimations of 9 other species of Linanthus ranged from 1.31 to 3.51 Gb, with an average of 2.02 Gb. This assembly is the second published genome in the family Polemoniaceae, and the first for Linanthus. The other chromosome-level genome assembly in the family is Gilia yorkii. Gilia and Linanthus last shared a common ancestor about 60 MYA (Landis et al. 2018). While these species have not shared an evolutionary history for a long period of time, we provide these metrics as a context for comparative genomics in Polemoniaceae. G. yorkii was sequenced using PacBio at 67× coverage compared with the L. parryae at 39.4× coverage. The genome size of G. yorkii was estimated at 2.80 Gb and the L. parryae at 1.96 Gb. The BUSCO completeness score for G. yorkii was 96.8% compared with the 88.7% in L. parryae. The L. parryae assembly has a longer contig N50 than the G. yorkii genome (18.7 vs. 2.5 Mb) and a shorter scaffold N50 (76.8 vs. 285.8 Mb). The G. yorkii reads were assembled in 3,947 contigs and 2,043 scaffolds, compared with 208 contigs and 123 scaffolds for L. parryae (Jarvis et al. 2022).
A total of 71.99% of the L. parryae genome was annotated as repetitive. This is similar to the values reported for G. yorkii (75.60%), but higher than the average of 45.49% across plant species (Luo et al. 2022). Long-terminal repeat (LTR) retroelements made up 38.26% of the genome, less ; Q = Phred base accuracy QV (quality value); C = % genome represented by the first "n" scaffolds, following a known karyotype 2n = 18 (Patterson 1979 (Luo et al. 2022). The RepeatMasker annotation of repeat elements for L. parryae is summarized in Table 3. This reference genome will contribute to conservation in California in several important ways. L. parryae occurs in a region of California considered a hotbed of plant diversity and neoendemism (Kraft et al. 2010). In concert with other species in the CCGP, this genomic resource will help determine whether these geographic areas also correspond to hotspots of genomic diversity and provide important information for setting realistic priorities for conservation. Although a desert species, L. parryae occurs across a relatively wide and heterogeneous environment gradient with an elevational range of 360 to 2100 m, a precipitation range of 50 to 500 mm/ yr, and a geographic range that spans 470 km. The morphological and environmental variation along this species distribution may correspond with genetic differentiation between populations and potential local adaptation. Hence, this genome can serve as a foundation to investigate patterns of genomic diversity, connectivity, and health for other California plants, as well as shed light on the sorting of adaptive genomic variation across ecological gradients that are often associated with species divergence.
L. parryae occurs in fragile habitats that are particularly prone to habitat destruction, with at least 5 major ongoing threats. First, large swaths of the desert are now utilized by OHVs, including 200,000 acres of newly designated motorized recreation land established by the California Desert Protection Act of 2019 (Feinstein 2020). Second, photovoltaic power grid development is increasing with projects proposed or being built on an additional 30,000 acres of California desert (Wilson 2020). These developments can, and will, drastically impact unprotected desert lands (Hernandez et al. 2015). Third, urban sprawl spilling over from the Los Angeles Basin has increased habitat destruction in the southwestern edge of the Mojave Desert (Stewart 1997), an area with the highest density of L. parryae. Fourth, as water resources become more scarce and seasonal streams are diverted, ephemeral wash habitat for Linanthus species, as well as other desert dwellers, may become uninhabitable (Levick et al. 2008). Lastly, ecological pressure from ongoing invasive plants will be exacerbated with climate change (Smith et al. 2000). This reference genome will help determine the effects of these threats and enhance conservation plans in currently protected desert areas, helping resource managers determine which natural areas to prioritize for future protection.

Confict of Interest
The authors declare that by publishing this manuscript they have no conflicts of interest.

Acknowledgments
PacBio Sequel II library prep and sequencing were carried out at the DNA Technologies and Expression Analysis Cores at the UC Davis Genome Center, supported by National Institutes of Health (NIH) Shared Instrumentation Grant 1S10OD010786-01. Deep sequencing of Omni-C libraries used the Novaseq S4 sequencing platforms at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 OD018174 Instrumentation Grant. We thank the staff at the UC Davis DNA Technologies and Expression Analysis Cores and the UC Santa Cruz Paleogenomics Laboratory for their diligence and dedication to generating high-quality sequence data. We also thank the members of the Zapata lab for providing feedback on the manuscript.

Data availability
Data generated for this study are available under NCBI BioProject PRJNA777190. Raw sequencing data for sample IGA184 (NCBI BioSample SAMN26264369) are deposited in the NCBI Short Read Archive (SRA) under SRX15304035 for PacBio HiFi sequencing data, and SRX15304036, SRX15304037 for the Omni-C Illumina sequencing data. GenBank accessions for both primary and alternate assemblies are GCA_023055425.1 and GCA_023055565.1; and for genome sequences JALGPY000000000 and JALGPZ000000000. Assembly scripts and other data for the analyses presented can be found at the following GitHub repository: www.github.com/ccgproject/ccgp_assembly.